fix(eval): use function calling for tool/input mocking so non-OpenAI models work (AE-1646)#1692
Conversation
🚨 Heads up:
|
5e4bdb2 to
ae78cbe
Compare
|
Re: the Full suite is green: all eval test-cases (calculator-evals, simulation-testcase, tools-evals, etc.) pass across alpha/cloud/staging; lint, SonarCloud, build all pass. |
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| def _inline_defs( |
There was a problem hiding this comment.
Can we have separate classes for how we are doing Anthropic, OpenAI, Gemini etc. ?
There was a problem hiding this comment.
Done in a8a3307 — split into per-provider strategy classes: OpenAIStructuredOutput (response_format preferred, tool fallback), AnthropicStructuredOutput and GeminiStructuredOutput (straight to a forced tool call, since their response_format is broken on the gateway — live-verified: Claude returns prose, Gemini empty content), with a shared ToolCallStructuredOutput base. Also avoids the wasted known-dead request per simulation for non-OpenAI models. E2E results on all 3 models are in the updated PR description.
mjnovice
left a comment
There was a problem hiding this comment.
Minor comment about making the generate_structured_output more modular.
b4954be to
5bddc0f
Compare
There was a problem hiding this comment.
Pull request overview
This PR addresses eval tool/input simulation failures for non-OpenAI models by introducing a provider-agnostic structured-output helper and updating the eval mockers and LLM gateway integration to support function-calling style structured responses (including nested schemas via raw-dict tool passthrough).
Changes:
- Add
generate_structured_output()helper to preferresponse_formatwhen available and fall back to a forced tool call when content is empty/unsupported. - Update eval LLM mocker and input mocker to use the shared structured-output helper, and adjust/extend unit tests for the new behavior (including non-OpenAI fallback).
- Update
UiPathLlmChatService.chat_completions()to accept raw-dict tools (pass-through) so nested JSON-schema tool parameters are preserved; bump package versions accordingly.
Reviewed changes
Copilot reviewed 10 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/uipath/uv.lock | Bumps uipath and uipath-platform locked versions. |
| packages/uipath/pyproject.toml | Bumps uipath version and raises minimum uipath-platform dependency. |
| packages/uipath-platform/uv.lock | Bumps uipath-platform locked version. |
| packages/uipath-platform/pyproject.toml | Bumps uipath-platform version. |
| packages/uipath-platform/src/uipath/platform/chat/_llm_gateway_service.py | Allows passing raw-dict tools through to the normalized gateway request body. |
| packages/uipath-platform/tests/services/test_uipath_llm_integration.py | Adds coverage ensuring raw-dict tool schemas are forwarded unchanged and tool_choice serialization works. |
| packages/uipath/src/uipath/eval/mocks/_structured_output.py | New shared helper for structured output with response_format-first + tool-call fallback; includes schema $defs inlining logic. |
| packages/uipath/src/uipath/eval/mocks/_llm_mocker.py | Switches mock response generation to the shared structured-output helper; improves error propagation. |
| packages/uipath/src/uipath/eval/mocks/_input_mocker.py | Switches input generation to the shared structured-output helper. |
| packages/uipath/tests/cli/eval/mocks/test_structured_output.py | New unit tests validating schema wrapping/inlining and response extraction/fallback behavior. |
| packages/uipath/tests/cli/eval/mocks/test_mocks.py | Updates existing mocks and adds a non-OpenAI fallback regression test (AE-1646). |
| packages/uipath/tests/cli/eval/mocks/test_input_mocker.py | Adds assertion that OpenAI path uses response_format without tool fallback. |
Comments suppressed due to low confidence (1)
packages/uipath/src/uipath/eval/mocks/_input_mocker.py:158
generate_llm_input()no longer surfaces a clear JSON-parsing error when the LLM returns invalid JSON (the previous code raisedUiPathInputMockingError("Failed to parse LLM response as JSON: ...")). Now ajson.JSONDecodeErrorfromgenerate_structured_output()is wrapped as a generic "Failed to generate input" error, which makes debugging structured-output failures harder.
except UiPathInputMockingError:
raise
except Exception as e:
raise UiPathInputMockingError(f"Failed to generate input: {str(e)}") from e
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """Provider-agnostic structured output for the eval mockers. | ||
|
|
||
| The normalized LLM Gateway honors OpenAI-style ``response_format`` (json_schema) | ||
| only for OpenAI models — and does so reliably, including native ``$defs`` | ||
| support. Non-OpenAI providers (Anthropic/Claude via Bedrock, Gemini) return such | ||
| requests with ``choices[0].message.content`` empty/None, which breaks JSON | ||
| parsing. Function calling is honored across providers but is less reliable for | ||
| OpenAI on some schemas, so it is used only as a fallback: prefer | ||
| ``response_format`` and fall back to a forced tool call when the content comes | ||
| back empty. |
| presence_penalty: float = 0, | ||
| top_p: float | None = 1, | ||
| top_k: int | None = None, | ||
| tools: list[ToolDefinition] | None = None, | ||
| tools: list[ToolDefinition | dict[str, Any]] | None = None, | ||
| tool_choice: ToolChoice | None = None, | ||
| response_format: dict[str, Any] | type[BaseModel] | None = None, |
…models work Tool simulation and input generation in Studio Debug and Evaluation Set runs failed with AGENT_RUNTIME.UNEXPECTED_ERROR for non-OpenAI models (Anthropic Claude via Bedrock, Gemini). The mockers requested structured output via OpenAI-only `response_format` json_schema and parsed `choices[0].message.content`; for Claude that content is empty/None, so `json.loads(...)` raised. Switch both mockers to provider-agnostic function calling (mirrors llm_as_judge_evaluator): build a forced tool that wraps the output/input schema under a `response` property, force it via tool_choice, and read `tool_calls[0].arguments["response"]` (already a parsed dict). Hoist nested `$defs` to the tool-parameters root so `$ref`s from nested Pydantic models still resolve. The normalized LLM gateway now accepts raw-dict tools so arbitrary nested schemas survive (the ToolDefinition converter only emits flat properties). Regression introduced by #1555, which started routing the agent's model into simulations; before that, simulation always used a fixed OpenAI model, so non-OpenAI providers were never exercised on this path. Fixes AE-1646. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The mocker fix in uipath depends on the dict-tool passthrough in uipath-platform, so uipath's lower-bound pin is raised to 0.1.62. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ation The normalized gateway accepts $ref/$defs in response_format but not inside a tool's parameters. Tool outputs typed as nested Pydantic models/enums (e.g. calculator's get_random_operator -> Wrapper[Operator]) produced a tool schema with $ref/$defs that the gateway rejected, so simulation failed. Inline the definitions into a self-contained schema (cyclic refs keep their $defs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
All-tool-calling regressed OpenAI tool simulation (calculator-evals 'Test Random Addition Using LLM' became flaky: gpt_4_1_mini returned wrong/empty values for a nested-enum output schema via function calling, where response_format was reliable). Make structured-output generation adaptive: prefer response_format (honored reliably by OpenAI, native $defs support) and fall back to a forced tool call only when content comes back empty (the non-OpenAI failure mode, e.g. Claude/Bedrock). Shared in generate_structured_output(), used by both mockers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ON prose Live-verified on alpha: Claude answers response_format requests with plain prose (not empty content), so the empty-content check alone never triggered the fallback and AE-1646 persisted for Claude. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Address review feedback: separate classes per provider (OpenAI prefers response_format with tool fallback; Claude and Gemini go straight to a forced tool call, avoiding a known-dead request per simulation). Merge $ref sibling keys when inlining so field descriptions survive, and document raw-dict tools in chat_completions. Live-verified on alpha for gpt-4.1-mini, claude-sonnet-4-5 and gemini-2.5-pro: tool simulation (nested $defs schema) and input generation pass end-to-end through the @mockable pipeline. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
5bddc0f to
a8a3307
Compare
|



Summary
Tool simulation and input generation in Studio Debug and Evaluation Set runs failed with
AGENT_RUNTIME.UNEXPECTED_ERRORfor non-OpenAI models (Anthropic Claude via Bedrock, Gemini), but worked with OpenAI/GPT.Fixes AE-1646 (customer: Sarasota Memorial Health Care System).
Root cause (live-verified on alpha)
Both eval mockers requested structured output via OpenAI-only
response_formatjson_schema and parsedresponse.choices[0].message.content. The normalized LLM Gateway only honorsresponse_formatfor OpenAI models — and the failure shape differs per provider (verified by live calls, not assumed):response_formatbehavior on the gateway$defssupport'Tokyo')Both non-OpenAI shapes broke
json.loads(...)→ wrapped asUiPathMockResponseGenerationError→AGENT_RUNTIME.UNEXPECTED_ERROR. Note the prose case matters: an earlier revision of this PR fell back only on empty content, which fixed Gemini but left Claude — the customer's model — still failing. The live e2e below caught that.Regression from #1555, which started routing the agent's model into simulations; before that, simulation always used a fixed OpenAI model.
Fix
Provider-aware strategy classes in
eval/mocks/_structured_output.py(mirroringllm_as_judge_evaluator, whose docstring already states function calling is the cross-provider way to get structured output):OpenAIStructuredOutput— prefersresponse_format(reliable for OpenAI incl.$defs), falls back to a forced tool call on empty/non-JSON content or request error.AnthropicStructuredOutput/GeminiStructuredOutput— go straight to a forced tool call (theirresponse_formatis known-broken on the gateway; no wasted request per simulation).response_formatfirst with tool fallback (safe default).The forced tool wraps the output/input schema under a
responseproperty (tool_choice=required), and readstool_calls[0].arguments["response"](already a parsed dict).$defs/$refare inlined so tool parameters are self-contained (the gateway accepts$defsinresponse_formatbut not in tool parameters); sibling keys on$refnodes (e.g. field descriptions) are merged over the inlined definition so LLM guidance survives.chat_completionsnow accepts raw-dict tools (pass-through) so arbitrary nested schemas reach the gateway verbatim.E2E validation (live against alpha gateway)
Full customer path per model —
@mockable→Mocker.response→generate_structured_output→ real gateway → Pydantic coercion of the result. Tool simulation used a nested return model (enum + list of sub-models ⇒ real$defs/$ref); input generation rangenerate_llm_inputthe same way.$defs)Calculation, total=42.0response_format)Calculation, total=42.0Calculation, total=42.0Negative control: reverting the non-JSON-content fallback and re-running the Claude e2e reproduces the exact customer error (
UiPathMockResponseGenerationError: Expecting value: line 1 column 1 (char 0)), confirming the e2e discriminates.Tests
response_format; OpenAI →response_formatpreferred; unknown → fallback chain (incl. prose-content, empty-content, and request-error triggers).$defsinlining: self-contained output, sibling-key merge, cyclic-ref handling, caller schema not mutated.test_raw_dict_tool_passthrough_mocked(platform): nested array schema forwarded byte-for-byte.tests/cli/evalsuite (405) + platform mocked LLM tests green; ruff + mypy clean on both packages.Review feedback addressed
OpenAIStructuredOutput,AnthropicStructuredOutput,GeminiStructuredOutput, sharedToolCallStructuredOutputbase).chat_completionsdocstring updated for raw-dict tools.🤖 Generated with Claude Code